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Abstract —  Visual  Human  Intent  Analysis  (VHIA)  is  a  growing  field 
of  research  devoted  to  algorithms  that  categorize  human  behavior 
through  input  of  visual  image  sequences.  In  VHIA,  it  is  difficult  to 
make  an  implementation  framework  that  is  robust  enough  to  handle 
the  many  non-deterministic  ways  of  human  physical  actions.  This 
paper  will  survey  several  techniques  currently  used  in  VHIA 
including  non-traditional  artificial  intelligence  techniques,  visual 
languages,  statistical  algorithms,  and  others.  It  will  give  the  reader 
the  background  and  understanding  of  how  open  this  domain  is  to  new 
algorithms;  improving  on  the  overall  research  in  VHIA  or  similarly 
related  problems. 

Keywords —  Activity  recognition,  artificial  intelligence,  visual  human 
intent  analysis,  visual  understanding. 


I.  Introduction 

s  the  technologies  for  developing  Visual  Understanding 
systems  (VU)  moves  toward  full  commercial  possibility, 
the  desire  to  take  VU  algorithms  out  of  the  research  labs  and 
into  real-world  applications  is  growing.  One  sub  research  area 
of  VU  is  Visual  Human  Intent  Analysis  (VHIA),  also  referred 
to  as  visual  human  behavior  identification,  visual  action  or 
activity  recognition,  and  understanding  human  actions  from 
visual  cues.  In  static  self  security  systems,  visual  human 
behavior  identification  systems  will  aide  or  replace  security 
guards  monitoring  CCTV  feeds.  Television  stations  will 
require  activity  recognition  systems  to  automatically 
categorize  and  store  video  scenes  in  a  database  for  quick 
search  and  retrieval.  The  military  is  pushing  robotics  to 
replace  the  soldier,  thus  requiring  the  need  to  understand 
human  actions  from  visual  cues  to  determine  hostile  actions 
from  people  so  the  robot  can  take  appropriate  actions  to  secure 
itself  and  its  mission.  These  are  just  a  few  areas  where  VHIA 
will  increase  current  state  of  the  art  in  the  development  and 
use  of  future  systems.  This  paper  surveys  current  research  in 
the  area  of  Visual  Human  Intent  Analysis. 

II.  NON-TRADITIONAL  TECHNIQUES 

A  good  deal  of  the  work  in  VHIA  uses  visual  cues  of  the 
human  actions  without  any  traditional  computational 
intelligent  techniques  [10,  12,  15,  17,  23,  25,  27,  31,  33,  34, 
37,  43,  46,  47,  49,  50,  52,  55,  61,  62,  63].  These  algorithms 
rely  on  simplicity  at  the  cost  of  fusing  input  data  or  they  are 

M.  S.  Del  Rose  is  with  the  U.S.  Army  Tank  Automotive  Research 
Development  and  Engineering  Center  (TARDEC),  Warren,  MI  49397  USA 
(Desk:  596.306.4059;  email:  mike.delrosefajus. armv.mil). 

C.  C.  Wagner  is  with  Oakland  University,  Rochester  MI,  48309,  USA 
(Desk:  248.370.2215;  wagner@oakland.edu). 


using  less  than  typical  data  inputs.  They  rely  almost 
exclusively  on  the  pre-processing  of  the  data  while  using 
statistical  or  non-traditional  AI  algorithms  to  determine  the 
behavior.  For  example,  M.  Cristani  et.  al.  [15]  uses  both  audio 
and  visual  data  to  determine  simple  events  in  an  office.  First, 
they  remove  foreground  objects  and  segment  the  images  in  the 
sequence.  This  output  is  coupled  with  the  audio  data  and  a 
threshold  detection  process  is  used  to  identify  unusual  events. 
These  event  sequences  are  put  into  an  audio  visual 
concurrence  matrix  (AVC)  to  compare  with  known  AVC 
events. 

Many  systems  use  their  own  form  of  plotting  space-time 
data  from  the  image  sequences  to  calculating  the  closest 
distance  to  pre-determined  or  automatically  determine  events 
to  decide  what  action  the  human  is  performing  [10,  12,  16,  17, 
19,  23,  25,  27,  30,  31,  34,  35,  43,  50,  55,  59],  For  example, 
M.  Dimitrijevic  et.  al.  [17]  developed  a  template  database  of 
actions  based  on  five  male  and  three  female  individuals.  Each 
human  action  is  represented  by  a  set  of  three  frames  of  their 
2D  silhouette:  the  frame  when  the  person  first  touches  the 
ground  with  one  of  his/her  foot,  the  frame  at  the  midstride  of 
the  step,  and  the  end  frame  when  the  person  finishes  touching 
the  ground  with  the  same  foot.  The  three  frame  sets  were 
taken  from  seven  camera  positions.  When  determining  the 
event,  they  use  a  modified  Chamfer’s  distance  calculation  to 
match  to  the  template  sequences  in  the  database. 

In  another  example,  D.  Weiland  et.  al.  [10]  uses  motion 
history  volumes  to  determine  human  gestures  by  extending  the 
2D  pixel  representation  with  time  to  make  a  3D 
representation.  This  is  accomplished  by  using  multiple 
cameras  around  the  person  and  subtracting  out  any 
background.  Classes  are  created  manually  for  each  action  or 
gesture.  Mahalanobis  distance  with  principle  component 
analysis  is  used  to  identify  action  from  the  appropriate  class. 

The  work  from  A.  Mokhber  et.  al.  [23]  uses  volumetric 
models  to  determine  human  actions.  Features  representative 
of  the  action  sequence  are  extracted  from  the  binary  images 
and  used  to  compute  the  space-time  volumes.  This  is  so 
people  are  seen  globally.  The  volumes  are  formed  from  all  the 
binary  images  in  the  action  and  are  concatenated  together  in 
chronological  order.  These  volumes  make  up  the  feature 
vector  of  moments.  Comparison  of  testing  data  into  classes 
formed  from  the  training  data  is  by  Mahalanobis  distances. 

Eigenspace  is  used  to  help  categorize  actions  for  distance 
computations  to  identify  events  by  M.  Rahman  et.  al.  [25]. 
They  believe  that  these  should  be  used  since  they  are  highly 
mathematical  and  require  less  image  processing.  Motion  from 
a  camera  is  captured  and  manually  placed  into  classes  and 
then  finally  into  a  covariance  matrix.  This  makes  up  the 
universal  image  set  for  that  action.  Eigenvalues  and 
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eigenvectors  are  calculated  and  using  Karhunen-Loeve 
technique  the  best  ones  that  describe  the  action  are  kept.  This 
makes  up  an  orthogonal  coordinate  system.  To  recognize  an 
unknown  image  sequence  behavior,  a  distance  measure  is  used 
for  the  calculated  eigenvalues  and  eigenvectors  onto  the 
coordinate  system.  This  study  looked  at  five  different  cricket 
events  with  fairly  good  recognition. 


III.  TRADITIONAL  ARTIFICIAL  INTELLIEGENT 
TECHNIQUES 

Some  traditional  artificial  intelligence  techniques  are  used 
for  identifying  intent,  many  with  a  spin  towards  specific 
motions  [13,  19,  20,  26,  28,  35,  40,  42,  45,  58,  63].  For 
instance,  J.-Y.  Yang  et  al.  [19]  uses  neural  networks  to 
determine  human  actions.  They  differ  from  the  normal  human 
motion  capture  of  placing  tags  on  body  parts  for  tracking. 
They  strap  a  tri-axial  accelerometer  to  the  person’s  wrist  to 
detect  three  degrees  of  motion  on  a  specific  body  part.  Tri- 
axial  data  is  captured  on  set  time  intervals  and  processed  to 
feed  into  a  neural  network  to  determine  if  it  is  a  static  event, 
like  standing  or  sitting,  or  a  dynamic  event,  like  walking  and 
running.  Once  this  is  determined,  either  a  static  event  neural 
network  or  a  dynamic  event  neural  network  is  used  to 
determine  the  action.  The  results  are  promising  for  the  limited 
actions  to  detect  (standing,  sitting,  walking,  running, 
vacuuming,  scrubbing,  brushing  teeth,  and  working  on  a 
computer). 

Rule  based  and  Fuzzy  systems  are  other  types  of 
computational  intelligent  technique  used  to  identify  patterns 
and  have  been  adapted  to  analyze  human  events  [13,  16,  47]. 
H.  Stern  et  al.  [16]  created  a  prototype  fuzzy  system  for 
picture  understanding  of  surveillance  cameras.  His  model  is 
split  into  three  parts,  pre-processing  module,  a  static  object 
fuzzy  system  module,  and  a  dynamic  temporal  fuzzy  system 
module.  The  static  fuzzy  system  module  takes  in  the  pre- 
processed  data  and  outputs  the  number  of  people  in  the  image: 
single  person,  two  people,  three  people,  many  people,  or  no 
people.  Then  the  dynamic  fuzzy  system  determines  the  intent 
of  the  person  based  on  the  temporal  movements. 


IV.  MARKOV  MODELS  AND  BAYESIAN  NETWORKS 

Developing  Markov  or  Bayesian  networks  are  a  common 
approach  to  visual  human  behavior  analysis  research.  These 
methods  seem  to  fit  the  logical  approach  of  having  a  sequence 
of  images  making  up  an  action.  There  are  several  ways  to 
develop  a  network  from  the  input  data  [1,  8,  14,  18,  19,  30,  36, 
38,  42,  46,  48,  57].  In  the  work  of  Du,  Chen,  and  Xu  [1],  they 
use  a  Coupled  Hierarchal  Durational  -  State  Dynamic 
Bayesian  Network  (CHDS-DBN)  to  model  human  actions. 
They  claim  that  to  understand  human  actions,  frameworks 
should  have  both  motion  corresponding  to  the  interaction  as 
well  as  details  of  the  motion  on  different  scales.  For  the  most 
part,  research  did  not  include  interaction  with  other  people  as  a 
determinate  in  understanding  intent  of  a  single  person.  Most 
work  is  on  motion  characteristics  of  the  individual  alone.  This 
approach  adds  a  decision  base  to  normal  action  understanding. 


The  work  by  A.  Galata  et.  al.  [8]  uses  variable  length 
Markov  models  for  human  understanding.  They  claim  that 
using  variable  length  Markov  models  provide  a  more  efficient 
way  to  represent  behaviors  and  more  flexibility  then  common 
models  when  using  a  large  temporal  scale. 


V.  GRAMMARS 

In  constructing  networks,  many  researchers  use  grammars 
[5,  22,  37,  41,  44,  50,  52,  56,  57,  58,  59]  as  describing  the 
sequence  of  events  the  body  makes  to  help  calculate  the 
actions  of  the  person  visually.  Grammars  are  mathematically 
based  and  they  seem  to  flow  well  with  understanding  actions 
because  several  frames  are  used  in  a  network  fashion.  A. 
Ogale  et.  al.  [22]  uses  probabilistic  context  free  grammars 
(PCFG)  in  short  action  sequences  of  a  person  from  video. 
Body  poses  are  stored  as  silhouettes  which  are  used  in  the 
construction  of  the  PCFG.  Pairs  of  frames  are  constructed 
based  on  their  time  slot:  the  pose  from  frame  1  and  2  are 
paired,  the  pose  from  frame  2  and  3  are  paired,  and  so  on. 
These  pairs  construct  the  PCFG  for  the  given  action.  When 
testing  the  algorithm,  the  same  procedure  is  followed. 
Comparing  the  testing  data  with  the  trained  data  is 
accomplished  through  Bayes. 

The  work  from  Starner,  Weaver,  and  Pentland  [5]  also  use 
grammars  in  constructing  their  network.  Phrase  grammars 
were  used  to  distinguish  the  type  of  action  from  hand  signals 
that  can  be  networked  together  to  form  the  meaning  of  the 
sentence  signed  using  American  Sign  Language.  In  this  case, 
phrase  grammars  limit  the  search  set  of  words  to  improve  the 
accuracy  of  what  is  being  described.  They  also  speed  up  the 
process  over  not  using  grammars.  A  Hidden  Markov  Model  is 
used  to  train  and  test  the  data. 


VI.  TRADITIONAL  HIDDEN  MARKOV  MODELS 

Of  all  the  VHIA  networks  constructed.  Hidden  Markov 
Models  (HMM)  are  the  most  widely  used  [3,  4,  5,  7,  8,  9,  11, 
20,  21,  24,  26,  38,  45,  48,  51,  64].  Yamato  et.  al.  [3]  used 
HMMs  to  recognize  six  different  tennis  strokes  by  creating  a 
feature  vector  of  a  25x25  mess  overlaid  on  the  image  to  help 
describe  body  positions  in  each  frame.  Campbell,  Becker,  and 
Azarbayejani  [4]  used  HMMs  to  recognize  eighteen  Tai  Chi 
moves.  Each  move  was  represented  by  a  series  of  vectors 
formed  by  the  3D  position  of  the  head  and  the  hands.  Yu  and 
Ballard  [9]  use  HMMs  to  distinguish  similar  action  based  on 
head  and  eye  movements.  The  work  of  Lee  and  Kim  [7] 
shows  how  to  use  HMMs  for  gesture  recognition.  A  threshold 
model  is  built  to  provide  dynamic  threshold  values  to 
distinguish  between  meaningful  and  meaningless  gestures. 
HMMs  are  used  to  train  and  test  the  predefined  gestures. 
Gehrig  and  Schulz  [24]  used  HMMs  to  recognize  ten  kitchen 
actions  based  on  the  movement  of  twenty  four  points  on  the 
upper  body.  They  looked  at  skeletal  data  and  calculated  the 
correct  movements  of  people  to  reduce  the  number  of  body 
parts  down  to  thirteen.  Gao  et.  al.  [26]  used  both  Optical  Flow 
Tensors  (OFT)  and  HMMs  to  distinguish  basketball  shot 
actions  from  video.  Optical  flow  fields  are  modeled  from  the 
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video  frames  at  several  resolutions  and  a  tensor  is  built,  this  is 
the  OFT.  Reducing  the  dimensionality  of  the  data  is 
accomplished  by  first  applying  a  general  Tensor  Discriminate 
Analysis  Function  then  a  Linear  Discriminate  Analysis 
function.  An  HMM  is  used  to  train  and  test  the  final  features. 


VII.  NON-TRADITIONAL  HIDDEN  MARKOV  MODELS 

Other  forms  of  HMMs  have  been  developed  to  handle  more 
specific  problems  with  HMM  based  action  recognition 
systems  [2,  6,  11,  14,  18,  20,  21,  28,  39,  51,  54,  60,  66], 
Wilson  and  Bobick  [6]  use  a  Parametric  Hidden  Markov 
Model  (PHMM)  to  recognize  gesture.  The  PHMM  has  an 
additional  parameter  used  to  represent  meaningful  variations 
of  gestures  across  the  set  of  all  gestures.  This  gives  PHMMs 
the  ability  to  distinguish  between  gesture  meanings  with 
similar  hand  movements. 

Oliver  et.  al.  [11]  developed  a  real  time  system  that  detects 
and  classifies  interactions  between  people  using  a  Coupled 
Hidden  Markov  Model  (CHMM).  They  used  synthetic 
environments  to  model  person  to  person  interactions  and  thus 
creating  their  CHMM.  Data  from  a  static  camera  was  used 
and  moving  objects  were  segmented  and  tracked.  Data  about 
their  location,  heading,  and  relative  location  to  other  people 
were  inputted  into  the  synthetically  created  CHMM  for 
analysis  of  the  type  of  interaction.  Results  show  they 
outperformed  standard  HMMs.  This  is  not  a  far  stretch  since 
standard  HMMs  work  on  single  automatons  where  CHMMs 
work  on  coupled  automatons,  thus  HMMs  cannot  outperform 
CHMMs  in  this  environment. 

Multi-Observation  Hidden  Markov  Models  (MOHMM)  are 
discussed  in  both  [28]  and  [14]  from  Xaing  and  Gong.  In  [28] 
they  use  MOHMMs  to  create  breakpoints  in  video  content  of 
activity.  Blobs  above  a  certain  threshold  in  each  frame  are 
segmented  from  the  pixel  change  history.  Several  functions  of 
these  blobs  are  used  in  the  feature  vector  to  classify  the  video 
with  the  MOHMM.  In  [14]  an  MOHMM  was  used  to  detect 
security  piggybacking  of  people  off  someone  else’s  card  to 
open  the  security  door.  Piggybacking  is  when  someone 
follows  another  person  through  a  security  door  without  using 
his/her  card.  The  framework  of  the  system  allowed  for 
continual  changes  based  on  changes  in  peoples’  movements, 
thus  unsupervised  learning  is  used  to  continually  change  the 
model. 

Gong  and  Xiang  [18]  developed  a  dynamic  multi-linked 
HMM  (DML-HMM)  to  recognize  group  activity  from  an 
outdoor  scene.  The  DML-HMM  is  based  on  salient  dynamic 
interlinks  among  multiple  temporal  events  using  Dynamic 
Probabilistic  Networks  (DPN).  Standard  HMMs  cannot  take 
into  account  the  multi  processes  needed.  The  DML-HMM 
was  built  to  handle  the  multitude  of  different  object  events. 
The  topology  is  determined  by  the  causality  and  temporal 
order,  automatically  made  using  the  Schwarz  Bayesian 
Information  Criterion  based  factorization.  They  claim  that 
instead  of  being  fully  connected  like  Coupled  HMMs 
(CHMM),  the  DHL-HMM  aims  to  only  connect  a  subset  of 
relevant  hidden  state  variables  across  multiple  temporal 
processes.  When  comparing  between  a  Multi-Observation 


HMM  (MOHMM),  a  Parallel  HMM  (PaHMM),  and  a 
CHMM,  the  DHL-HMM  performs  better  since  the  CHMM 
and  the  MOHMM  propagates  the  noise  through  the  systems 
and  the  PaHMM  discards  correlations  between  multiple 
temporal  processes. 

Continuous  HMMs  (cHMMs)  are  used  in  the  work  of 
Antonakaki  et.  al.  [20].  Their  work  classifies  abnormal 
behavior  of  people  based  on  both  the  short  term  behavior  of 
the  people  and  the  trajectory  of  the  people.  A  short  term 
behavior  is  a  behavior  that  can  be  classified  in  twenty  five 
frames,  or  one  second.  A  one  class  support  vector  machine 
(SVM)  is  used  to  distinguish  abnormal  behavior  from  the  short 
term  behavior  sequence.  For  trajectory  data,  a  one  class 
cHMM  is  used  to  determine  if  the  person’s  movement  is 
abnormal.  Both  are  used  to  determine  the  final  results. 

Layered  Hidden  Markov  Models  (LHMMs)  are  used  in 
Oliver  et.  al.  [21]  to  detect  people’s  activity  in  an  office.  They 
employ  a  two  level  cascade  of  HMMs  with  three  processing 
layers.  The  first  layer  captures  video,  audio  and 
keyboard/mouse  activity  and  creates  a  feature  vector.  The 
middle  layer  has  two  HMMs,  one  for  creating  an  audio  feature 
vector  and  one  for  creating  a  video  feature  vector.  The  top 
layer  uses  the  results  of  the  these  HMMs  along  with 
keyboard/mouse  activity  and  the  derivative  of  the  sound 
localization  component  as  its  feature  vector.  The  results  of  the 
top  layer  determine  the  activity  in  the  office.  They  claim  the 
layered  HMM  makes  it  feasible  to  decouple  different  levels  of 
analysis  for  training  and  inferences.  By  using  a  single  HMM 
it  would  need  a  large  parameter  space,  thus  need  a  large 
amount  of  data  to  train. 

Liu  and  Chua  [2]  use  Observation  Decomposed  Hidden 
Markov  Model  (ODHMM)  to  model  and  classify  multi  people 
activity.  They  state  that  to  automatically  recognize  multi 
agents  (person,  extremity,  or  object)  is  very  challenging  due  to 
the  complexity  of  interactions  between  agents.  This 
complexity  stems  from  large  dimensionality  of  the  feature 
vectors  and  the  complex  mapping  of  agents  from  input  data  to 
pre-defined  activity  models.  To  handle  this  problem,  they 
decompose  each  feature  vector  into  a  set  of  sub-feature  vectors 
for  the  ODHMM. 

VIII.  SUMMARY  OF  VISUAL  NUMAN  INTENT 
ANALYSIS  DEVELOPMENT 

Across  all  the  previous  research,  several  common 
requirements  stand  out. 

-Detect  and  segment  objects  in  each  frame 

-Determine  relative  position  and  orientation  of 
each  object 

-Identify  a  meaningful  sequence  of  frames  from 
the  visual  input 

-  Store  and  retrieve  past  sequence  of  behaviors  for 
identification  of  current  ones. 

To  detect  and  segment  out  objects  in  each  frame,  there 
needs  to  be  a  well  developed  image  processing  set  of  tools.  If 
the  goal  is  to  identify  the  arm/hand  movements  like  in  [5]  to 
classify  American  Sign  Language  words  from  visual 
identification  of  the  hands,  then  the  image  processing  is  very 
important.  Any  misprocessing  of  the  input  data  that  goes  into 
the  classifier  may  cause  inaccuracy. 


UNCLASSIFIED:  Dist  A.  public  release 


UNCLASSIFIED 


Determining  the  relative  position  and  orientation  of  the 
object  also  requires  good  image  processing  techniques.  In 
some  instances,  just  the  movement  is  important  and  not  the 
exact  location  from  frame  to  frame,  as  in  [11].  This  case 
requires  less  analysis  of  the  processed  input  data  into  the 
classifier  than,  say,  [10]  where  body  orientation  is  important  to 
match  with  preselected  human  actions  from  several  angles. 

To  identify  a  meaningful  sequence  of  frames,  stops  should 
be  placed  before  and  after  each  interesting  area.  There  is  a 
large  amount  of  research  just  on  finding  separations  in  actions. 
If,  for  example,  HMMs  are  to  be  used  and  codebooks  are 
required  to  identify  common  actions  then  having  equal  length 
action  sequences  for  comparison  is  important.  This  would 
require  either  adding  to  the  processed  sequence  several  inputs 
or  subtracting  parts  in  the  middle  which  require  little  attention, 
like  slow  movements.  Either  way,  the  intelligence  of  the 
algorithm  has  greatly  increased  which  requires  a  lot  of  work  in 
automating.  In  [17],  they  have  taken  out  three  frames  that 
describe  the  action  from  a  small  set.  To  automate  this  process, 
it  requires  a  lot  of  image  processing  to  correctly  identify  the 
starting,  middle  and  ending  location  of  a  person’s  stride. 

Identification  of  sequence  could  also  mean  sequences  that 
have  no  visual  information  past  on,  like  in  [27].  They  have 
identified  a  way  of  processing  the  visual  information  to  a  point 
where  only  a  few  values  represent  the  sequence.  The  reader  is 
caution  that  the  further  away  the  data  is  from  its  original 
values,  the  more  errors  are  introduced  in  the  system. 

To  store  past  behaviors,  there  must  be  knowledge  of  the 
behavior.  In  intelligent  systems,  this  is  usually  done  through 
learning;  however,  it  can  also  be  done  through  human 
intervention.  If  human  intervention  is  performed  on  setting  up 
parameters  for  known  behaviors,  then  often  times  patterns  that 
are  sometimes  found  through  the  learning  process  is  missed, 
thus  causing  misclassification  of  actions.  It  is  suggested  that  a 
combination  of  both  computer  learning  and  human  interaction 
is  used.  This  requires  heavy  analysis  on  the  training  and 
testing  data  to  completely  encompass  the  range  of  each 
activity  being  performed,  a  large  data  set  to  perform  several 
iterations  of  training  and  testing  of  data,  and  in-depth 
knowledge  of  the  minor  facets  of  the  action.  Finally,  once  the 
user  has  concluded  that  the  training  data  is  complete  and  the 
baseline  action  sequences  are  stored,  then  to  retrieve  them 
takes  nothing  more  than  a  comparison  of  a  newly  processed 
action  to  the  most  likely  candidate  stored. 

IX.  Conclusion 

Much  of  the  human  brain  is  set  aside  for  processing  the 
visual  sense.  As  computing  power  has  continually  increased, 
and  as  ever  great  push  is  made  for  efficiencies  in  business  and 
government,  letting  automated  computers  perform  heretofore 
human  visual  tasks  could  lead  to  great  efficiencies  and 
improvements.  Visual  Human  Intent  Analysis  (VHIA)  is  a 
wide  open  field  of  research  with  several  different  methods  that 
relate  to  individual  solutions,  like  identifying  hand  gestures, 
understanding  different  tennis  strokes,  identifying  office 
activities,  and  many  others.  Each  method  described  above 
plays  on  the  solution’s  strengths  and  weaknesses  whether  it  is 
simplified  classification  with  heavy  pre-processing  or  a  more 
intelligent  decision  system.  The  wide  range  of  methods 


demonstrates  the  openness  to  new  and  innovative  solutions 
that  are  catered  to  one’s  own  problem. 

However,  with  all  the  different  techniques  there  are  four 
common  tasks  which  every  VHIA  systems  must  perform: 
detect  and  segment  objects  in  each  frame,  determine  relative 
position  and  orientation  of  each  object,  identify  a  meaningful 
sequence  of  frames  from  the  visual  input,  and  store  and 
retrieve  past  sequence  of  behaviors  for  identification  of 
current  ones.  These  tasks  are  performed  in  different  ways 
depending  on  the  type  of  classification  system.  They  require  a 
lot  of  image  processing,  analysis  on  the  input  data,  and  in- 
depth  knowledge  of  the  actions  with  respect  to  the  processed 
data. 
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